Haplotype-resolved assemblies and variant benchmark of a Chinese Quartet

Background Recent state-of-the-art sequencing technologies enable the investigation of challenging regions in the human genome and expand the scope of variant benchmarking datasets. Herein, we sequence a Chinese Quartet, comprising two monozygotic twin daughters and their biological parents, using four short and long sequencing platforms (Illumina, BGI, PacBio, and Oxford Nanopore Technology). Results The long reads from the monozygotic twin daughters are phased into paternal and maternal haplotypes using the parent–child genetic map and for each haplotype. We also use long reads to generate haplotype-resolved whole-genome assemblies with completeness and continuity exceeding that of GRCh38. Using this Quartet, we comprehensively catalogue the human variant landscape, generating a dataset of 3,962,453 SNVs, 886,648 indels (< 50 bp), 9726 large deletions (≥ 50 bp), 15,600 large insertions (≥ 50 bp), 40 inversions, 31 complex structural variants, and 68 de novo mutations which are shared between the monozygotic twin daughters. Variants underrepresented in previous benchmarks owing to their complexity—including those located at long repeat regions, complex structural variants, and de novo mutations—are systematically examined in this study. Conclusions In summary, this study provides high-quality haplotype-resolved assemblies and a comprehensive set of benchmarking resources for two Chinese monozygotic twin samples which, relative to existing benchmarks, offers expanded genomic coverage and insight into complex variant categories. Supplementary Information The online version contains supplementary material available at 10.1186/s13059-023-03116-3.

1 1 Generation of sequencing data

Library preparation and sequencing of HiFi reads
SMRTbell target size libraries were constructed for sequencing according to PacBio's standard protocol (Pacific Biosciences, CA, USA) using 15kb preparation solutions.
The main steps for library preparation are: (1) gDNA shearing, (2) DNA damage repair, end repair and A-tailing, (3) ligation with hairpin adapters from the SMRTbell Express Template Prep Kit 2.0 (Pacific Biosciences), (4) nuclease treatment of SMRTbell library with SMRTbell Enzyme Cleanup Kit, (5) size selection, and (6) binding to polymerase.Briefly, a total amount of 15 µg DNA per sample was used for the DNA library preparations.The genomic DNA sample was sheared by gTUBEs (Covaris, USA) according to the expected size of the fragments for the library Single-strand overhangs were then removed, and DNA fragments were damage repaired, end repaired and A-tailed.The fragments were then ligated with the hairpin adaptor for PacBio sequencing.The library was then treated with nuclease using SMRTbell Enzyme Cleanup Kit and purified using AMPure PB beads.Target fragments were screened by BluePippin (Sage Science, USA) with the Agilent 2100 Bioanalyzer (Agilent technologies, USA) used to detect the size of library fragments.Sequencing was performed on a PacBio Sequel II instrument with Sequencing Primer V2 and Sequel II Binding Kit 2.0 at the Genome Centre of Grandomics (Wuhan, China).

Library preparation and sequencing of regular ONT reads
A total amount of 3-4 µg DNA per sample was used as input material for the ONT library preparations.After the sample was qualified, size-selection of long DNA fragments was performed using the Pippin HT system (Sage Science, USA).The ends of the DNA fragments were then repaired, and A-ligation reactions were conducted with the NEB Next Ultra II End Repair/dA-tailing Kit (Cat# E7546).The adapter in the SQK-LSK109 kit (Oxford Nanopore Technologies, UK) was used for further ligation reactions.The DNA library was quantified using a Qubit® 4.0 Fluorometer (Invitrogen, USA).Approximately 700ng DNA libraries were constructed and sequenced using a Nanopore PromethION (Oxford Nanopore Technologies, UK) at the Genome Centre of GrandOmics (Wuhan, China).

Library preparation and sequencing of ultra-long ONT reads
For each ultra-long Nanopore library, approximately 8-10 µg of gDNA was size selected (>50 kb) with the SageHLS HMW library system (Sage Science, USA), and processed using the Ligation Sequencing 1D kit (SQK-LSK109, Oxford Nanopore Technologies, UK) according to the manufacturer' s instructions.Approximately 800ng DNA libraries were constructed and sequenced using a the PromethION (Oxford Nanopore Technologies, UK) at the Genome Centre of GrandOmics (Wuhan, China).
For each sequencing technology, the proportion of reads which were phased is given in Supplementary Table S2.

Performance evaluation
We used the reference-based approach (described in Section 2.1) to split the children's reads into two haplotypes according to their parents' reads.To compare the performance in this step with a reference-free phasing method, we also split the reads with canu [37] (v2.2) using the commands: "canu useGrid=false -haplotype -p output -d genomeSize=3.1g-haplotypePaternal father.fq.gz -haplotypeMaternal mother.fq.gzpacbio-hifi child2.fq.ga child2.fq.gz".We found that our reference-based approach could phase 76.2% of the HiFi reads, 64.4% of the regular ONT reads, and 70.5% of the ultra-long ONT reads, while canu could phase 65.3% of the HiFi reads, 79.9% of the regular ONT reads, and 83.0% of the ultra-long ONT reads (Supplementary Table S2).
To further evaluate the phasing accuracy, we also split the unphased reads into two haplotypes randomly.We then used the GenomeScope [100] pipeline to evaluate the heterozygosity rate of each haplotype, considering a lower heterozygosity rate to indicate a higher phasing accuracy.The steps taken and parameters were: jellyfish count -C -m 23 -s 2G -t 24 -o output.jsjellyfish histo -t 24 output.js> output.histRscript genomescope.R output.hist23 10000 output_dir Our reference-based phasing method achieved a phasing rate of 76.19% for HiFi reads, which is higher than the 65.34% achieved by canu.Moreover, our method demonstrated lower heterozygosity rates for paternal (0.0243-0.0251) and maternal (0.0235-0.0244) reads, whereas canu had higher heterozygosity rates for both paternal (0.3129-0.3185) and maternal (0.2819-0.2871) reads (Supplementary Table S2).These results indicate the superior phasing accuracy of our reference-based method.For ONT reads, although our method obtained a slightly lower phasing rate than canu, it also achieved higher phasing accuracy.In summary, under the promise of high phasing rate, our reference-based approach achieved a higher phasing rate.In summary, both our reference-based method and canu achieved comparable phasing rates, but our reference-based method achieved higher phasing accuracy.
Compared to reference-free phasing approaches, the reference-based method may be negatively affected by reads that cannot be mapped or that have low mapping quality.

Assembly parameters
In this study, we merged the reads of the two twin samples for each haplotype and assembled each haplotype with four assemblers.Specifically, we assembled HiFi reads

Evaluation of complex regions
Complex regions in the diploid human genome, such as segmental duplications, are challenging to validate through PCR or other biology experiments.Hence, we extracted the feature of HiFi read depth to evaluate the structure of the genome in these complex regions.We first aligned reads from paternal and maternal haplotypes to their corresponding genomes as well as to both GRCh38 and CHM13-T2T using minimap2 with default parameters.We considered that incorrect alignment in repetitive regions could affect the read depth, and that accordingly the read depth in these regions when aligned against CHM13-T2T could be used as a baseline.To detect abnormal read depths, we then randomly selected 1000 bins of 1000bp across each autosome, and defined the expected ranges of the read depth according to the interquartile range (IQR) of read depths in these bins.Depths lower than Q1-1.5*IQR or larger than Q3+1.5*IQR were defined as abnormal depths, where Q1 is the first quartile of the depth, Q3 is the third quartile, and IQR is Q3-Q1.The ratio of abnormal bins to total bin in the regions is calculated to evaluate the genome structure.For example, we found that the ratio of abnormal bins in our assemblies was comparable to CHM13-T2T in chr17:21,523,754-22371820, which suggested our genome achieved comparable quality with CHM13-T2T in this region (Supplementary Fig. 6).However, if we aligned the reads of our assemblies to GRCh38 and CHM13-T2T, the abnormal bins were greater those of the baseline.

Gap filling and genome polishing of Chinese Quartet genome
To improve the continuity of the assemblies, we scaffolded the hifiasm contigs using ragtag [71], guided by CHM13-T2T [31].Next, we utilized the contigs produced by HiCanu, Flye, and shasta to fill in the gaps of the hifiasm scaffolds, which include three steps (https://github.com/PengJia6/gapless).
Firstly, the scaffold was split into 100bp k-mers, and only k-mers within 200kb upstream and downstream of gaps were retained as candidate anchors.The candidate anchors were aligned to the scaffold using bwa mem (v0.7.17) and those k-mers with mapping quality more than 30 and without any mismatch were finally retained as anchors for next step.
Next, we aligned the anchors from the scaffold to the contigs to obtain potential gap sequences.We used bwa mem (v0.7.17) to align anchors to contigs, and anchors with mapping quality more than 30 were retained.We grouped the upstream and downstream anchors by contigs.For each gap, contigs with more than 10% upstream and downstream anchors of the scaffold were retained for gap filling.For each contig, its score for gap filling is calculated by the ratio of anchors in the contig to anchors of scaffolds.
Thirdly, we selected the contig with the greatest score to fill the gap.We selected the two closest anchors between upstream and downstream in a contig as representatives.
The sequence between the two anchors in the contig was extracted to replace corresponding sequences in scaffold.

Variant filtering with Mendelian rule
In this study, we evaluated the Illumina and HiFi variant calls of the twins in the Chinese Quartet using Mendelian rules.Variants that followed Mendelian inheritance were included in the germline variant calls, while those that violated the rules were classified as error calls, de novo mutations, or somatic mutations.
The pseudo code of Mendelian rule filtering are as follows:
SVision calls were generated using the "--min_sv_size 30 -s10".For each caller, variants of four samples were filtered according to the read depth, allele depth, and Mendelian rule.Finally, only variants supported by at least two callers were kept in the following analysis.The SV merging process is same as the description in section 4.3.1.

Structural variant detection by haplotype-resolved assemblies
We assembled ONT reads using shasta and flye and HiFi reads using hifiasm, hicanu and flye, obtaining five haplotype-resolved assemblies for each haplotype.Then, we discovered structural variants with five haplotype-resolved assemblies using PAV pipelines.Only variant supported by at least three callsets was kept in following steps.
We used the PAV pipeline (https://github.com/EichlerLab/pav,v1.1.0)released by HGSVC to call variants from haplotype resolved assemblies.Notably, the PAV pipeline was developed with snakemake, and we directly used the default parameters to generate phased variants in this study.Then, the variants were split according to their variant types.For each type of SV (Deletion, Insertion, and Inversion), the variants from five assemblies were merged by Jasmine [81] with the --output_genotypes min_overlap=0.5 setting.The variants with at least three assemblies supported in Jasmine merged files were kept in the following analysis.

Complex variant and inversion detection
To discover the complex structural variants in our benchmark, we obtained complex variants from five callsets.HiFi reads were applied to pbsv (v2.6.2),Sniffles [39] (v1.0.12),CuteSV [40] (v1.0.11),and SVision (v1.3.6) [23], and SVs labeled as Inversion, CSV, and multiple types were extracted as candidates.The inversion of HRA calls were also extracted as candidates.For manual curation, the sequencing alignments of Illumina reads, HiFi reads, and HRAs in candidate regions were visualized by IGV [82].The dotplot between HRAs and the reference genome of candidate regions was generated by Gepard [83].We then manually inspected all dotplots and IGV snapshots associated with a reported CVS locus (Supplementary Table S10, Supplementary File 1).Dotplots were used to show the background of the reference genome in the region as well as the differences between our assemblies and the reference genome.IGV was used to display the alignment of Illumina reads, HiFi reads, and haplotype-resolved assemblies in the candidate variants.

Detection of de novo and putative somatic mutations
We utilized the Illumina and HiFi reads of four samples to detect de novo and somatic mutations in the twins.To obtain high-quality germline calls, de novo and putative somatic calls, the "Mendelian rule" filtering in this study included two aspects.First, variants with inconsistent genotypes in the monozygotic twin daughters were removing in the germline call set.The variants were also removed when the genotypes of the daughters contradicted those of their parents.To identify the de novo and somatic mutation, the variants shared by both twins are included in the candidate set for de novo mutation, while the variants specific to one twin daughter are included in the candidate set for putative somatic mutation.To further reduce the false positives, variants located in repeat regions, including STR, VNTR, SD, and SM, are excluded.

Benchmarking regions and high-confidence regions definition
We added "benchmark regions" and "high-confidence regions (Tier1)" for false positive identification.The "benchmark regions" covered all variants, while the "highconfidence regions" contained variants supported by at least two technologies or validated by an independent technology.
To define the benchmark regions, we first mapped the two haplotypes of assemblies to the GRCh38, and regions covered by both haplotypes were kept.were removed.
In this study design, HiFi reads and haplotype-resolved assemblies facilitated the detection of variants in complex regions, including simple repeats (SR), segmental duplications (SD), and variable number tandem repeats (VNTR), while short reads are not accessible well.Thus, variants in our benchmark were divided into Tier1 (highconfidence) and Tier 2 call sets according to whether they were in repeat regions and their supporting technologies.
For SVs, high-confidence calls (Tier 1) and corresponding regions of v2.1 of the benchmark were obtained according to the following steps: a) Initially, we identified candidate high-confidence SVs within benchmark regions.This included all SVs supported by two or more technologies.For technology-specific SVs, we considered the following three criteria to classify them as candidate high-confidence SVs: i) They were not located within segmental duplications, VNTR (variable number tandem repeat), or simple repeat regions.
ii) They were not reported through short-read analysis.
iii) They were reported in an orthogonal callset.
b) Next, we refined the benchmark regions to create Tier 1 (v2.1)regions by excluding the following: i) Regions spanning SVs that did not meet the criteria in step a) were excluded.
ii) Non-diploid regions were excluded.
iii) Simple repeats that were overlap with variants not meet the criteria in step a) were excluded.
c) The candidate high-confidence SVs fully within the Tier 1 regions were defined as high-confidence SVs.
For small variants, high-confidence calls (Tier 1) of v2.1 were obtained according to the three steps: a) Initially, we identified candidate high-confidence small variants within benchmark regions.This included all small variants supported by all three technologies.For variants supported by two technologies, we considered those were not located within segmental duplications, VNTR, simple repeat, or short tandem repeat regions as candidate high-confidence calls.
b) Next, we refined the benchmark regions to create Tier 1 (v2.1)regions for small variants by excluding the following: i) Regions spanning SVs were excluded.
ii) Non-diploid regions were excluded.
iii) Regions spanning small variants that did not meet the criteria in step a) were excluded.
iv) Repeat regions (Simple repeats, segmental duplications, short tandem repeats, and VNTRs) that were overlap with small variants not meet the criteria in step a) were excluded.
c) The candidate high-confidence small variants fully within the Tier 1 regions were defined as high-confidence small variants.
In our study design, users can benchmark their call set in custom regions and also in off-the-shelf regions.For example, users can compare variants in the relatively conservative Tier 1 regions or in more challenging repetitive regions.This customizable range of comparisons expands the utilities of the benchmark.
In the benchmarking set, to determine whether a variant belongs to a specific region, we require that the variant is 100% within that region.Users have the flexibility to adjust this parameter through the configuration file.In the query set, we also include variants in the bordering regions of a given region, such as Tier 1.This inclusion helps mitigate false negatives caused by inaccurate variant breakpoints.Users can manually verify the accuracy of variants in these border regions.When calculating false positives, variants in the bordering regions are excluded.The default size for the border regions is set at 100 bp for SV comparison, but users can also adjust this parameter.
Furthermore, to address the potential bias introduced by differences in variant lengths during comparisons, we have implemented a certain tolerance setting.For example, when comparing SVs with a length of 50 bp or greater, we include variants larger than 30 bp in both the initial query set and the benchmark set.However, when calculating True Positive, False Positive, and False Negative, we only consider variants that are 50 bp or longer.This setting ensures accurate identification of variants around the 50 bp threshold, preventing false positives and false negatives.

Phasing regions definition
To defined the phasing regions of the benchmark, we aligned the phased reads to the reference genome.The regions with read depth more than 1/3 of the whole genome sequencing depth were extracted for each haplotype.We then merged the regions of two haplotypes and only regions supported by both haplotypes were included in final phasing regions.

Variant benchmark evaluation
In this study, we utilized the reads from Illumina, HiFi, and haplotype-resolved assemblies for variant discovery, while BGI and ONT reads were used as orthogonal technologies for providing additional evidence and determining whether technologyspecific variants were included in Tier 1 or Tier 2. To obtain a high-quality variant benchmark with low false positives and high sensitivity, we applied strict filtering of the variant within each variant discovery technology and integrated the filtered variants generated by all three technologies (Illumina, HiFi, and haplotype-resolved assemblies).
For variant validation, we used more flexible criteria to generate the initial BGI and ONT calls.
For BGI sequencing, we used deepvariant to call the variants and filtered variants with allele frequencies < 0.2.The variants were then normalized using the "norm" command in bcftools.